import warnings
warnings.simplefilter("ignore", UserWarning)
warnings.filterwarnings("ignore")
import zipfile
from urllib.request import urlopen
import dask.bag as db
import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client
import plotly.express as px
import numpy as np
import datetime
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as mtick
from matplotlib.pyplot import xticks
import matplotlib.dates as dates
matplotlib.style.use('seaborn')
import calplot
from folium.plugins import TimeSliderChoropleth
import folium
import branca.colormap as cm
import geopandas as gpd
import fiona
from shapely.geometry import Polygon, mapping
import shapefile
fiona.drvsupport.supported_drivers['KML'] = 'rw'
fiona.drvsupport.supported_drivers['LIBKML'] = 'rw'
# printing
from IPython.display import Markdown, display
def printmd(string):
display(Markdown(string))
from IPython.display import HTML, display
import pprint
pp = pprint.PrettyPrinter(indent=4, width=100)
HTML('''
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')

New York declared their first COVID-19 case on March 1, 2020. Since then, it has turned local economy upside down and spelled the deaths of thousands of people. For the safety of its people, the government has issued strict stay-at-home orders. For the New York taxi drivers, this has endangered their livelihoods and families. In this paper, we ask how travel patterns and behaviors have changed from 2019 to 2020 with the pandemic in place. To answer this question, we gathered data from the New York City (NYC) Taxi and Limousine Commission (TLC), pre-processed it, and conducted an exploratory data analysis (EDA) to provide descriptive analytics. The dataset considered in this study spans from January 1, 2019 until June 30, 2020, covering until Phase 2 of New York’s reopening plans. From our EDA, we found little change in routes and in travel times of New Yorkers, but instead we found that New Yorkers strictly complied with the stay-at-home measures enacted by the government, thus signifying a drop in the volume of taxi transactions. People usually travelled alone in pre-pandemic times, but this phenomenon increased during the pandemic. We also found that cashless transactions were banned by legislators, and people are experiencing the impacts of this legislation. With the reopening of New York, we are now seeing a small increase in volume of transactions, and this should help NYC taxi drivers.
China announced the first case of coronavirus on December 31, 2019 [1]. By March 1, 2020, New York City (NYC) confirmed its first case [2]. Unlike other parts of the world affected by the virus, the government of New York City initially brushed off the crisis, with the NYC Health Commissioner even stating that the risk of New Yorkers were “low” [3]. A month later, it quickly became the epicenter of the US as it had more cases than China [4].
Government response has worked double time to balance safety of the people as well as health of the economy. However, the impact of the pandemic still stands to this day: unemployment has surged, businesses and restaurants have shut down, and lives have been lost [5].
In this paper, we zoom in to see how the pandemic has affected transactions made with New York taxi drivers and compare this to their figures before the pandemic. With New York City being the epicenter of the COVID-19 pandemic, how has travel behavior changed through the course of 2020 compared to pre-pandemic times?
To answer this question, we extracted and cleaned the NYC TLC data and conducted an exploratory data analysis (EDA) to answer various questions regarding the pandemic’s effect on mobility, economy, and society. The whole process is done by a Dask cluster.
Data of trips taken by taxis and vehicles in New York City were retrieved from the NYC Taxi and Limousine Commission (TLC) Trip Record Data found in the Registry of Open Data on Amazon Web Services (AWS) S3 bucket [6].
Data available in the NYC TLC applicable for this research is only from January 1, 2019 to June 30, 2020. This corresponds to $9.36$ GB worth of data. After preprocessing, this corresponds to $80,810,133$ transactions in 2019 and $15,859,906$ transactions in 2020.
To uncover changes in mobility of New Yorkers due to COVID-19, a total of $96,670,039$ transactions were retrieved from AWS Registry of Open Data [6]. The general workflow for providing descriptive analytics as shown in Figure 1 involves the following steps:
Each step of the workflow will be discussed in the succeeding sections.
**Figure 1**. Workflow for descriptive analytics of NYC Taxi Trips.

We first set up our Dask scheduler and workers through Amazon Web Services’ Elastic Compute Cloud (AWS EC2) web service, as we will be dealing with $9.36$ GB of data thus the need for distributed computing. To run the succeeding codes, we have assigned Jojie as our client.
client = Client('72.44.51.212:8786')
client
After establishing the connections of our client, scheduler, and workers, we extracted NYC TLC data from the AWS Registry of Open Data [6]. The extracted data covers the period of January 2019 to June 2020 (the most recent data as of this study). To focus our efforts, the scope was trimmed down to Yellow Taxis which refers to the official taxicabs in New York City.
The following datasets were extracted to supplement our analysis:
.shp file [7];.txt file [8]; andThese datasets helped us in creating maps and accurately identifying zones.
df_2019 = dd.read_csv('s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv',
assume_missing=True,
storage_options={'anon': True})
df_2020 = dd.read_csv('s3://nyc-tlc/trip data/yellow_tripdata_2020-*.csv',
assume_missing=True,
storage_options={'anon': True})
%%html
<style>
table td, table th, table tr {text-align:left !important;}
</style>
We then cleaned the data by applying the following:
tpep_pickup_datetime; We arrive at the following variables, ready for use in the EDA. These variables are found in Table 1.
**Table 1**. Data description of variables used.
| Data Field | Description |
|---|---|
| PULocationID | ID of pick-up zone |
| DULocationID | ID of drop-off zone |
| payment_type | Payment Type where 1:Credit card, 2:Cash, 3:No charge, 4:Dispute |
| passenger_count | Number of passengers |
| trip* | Number of transaction |
| year* | Year of transaction |
| month* | Month of transaction |
| day* | Day of transaction |
| hour* | Hour of transaction |
*derived field
Separate Dask dataframes were built for hourly, daily, and monthly aggregation using helper functions.
The conduct of the exploratory data analysis and its structure are found in the succeeding section.
def clean_data(df, year):
"""Return a processed data frame"""
# Add month of transaction
df['month'] = (df.tpep_pickup_datetime
.astype('M8[us]')
.dt.month)
# Add day of transaction
df['day'] = (df.tpep_pickup_datetime
.astype('M8[us]')
.dt.day)
# Add year of transaction
df['year'] = (df.tpep_pickup_datetime
.astype('M8[us]')
.dt.year)
# Add hour of transaction
df['hour'] = (df.tpep_pickup_datetime
.astype('M8[us]')
.dt.hour)
# Add trip count per transaction
df['trip'] = 1
# Filter data
df = df[(df['total_amount'] > 0) &
(~df['passenger_count'].isna()) &
(df['passenger_count'] > 0) &
(df['PULocationID'] < 264) &
(df['DOLocationID'] < 264) &
(df['trip_distance'] > 0) &
(df['year'] == year)]
if year == 2020:
df = df[df['month'] < 7]
columns = ['month', 'day', 'year', 'hour', 'trip',
'payment_type', 'passenger_count', 'PULocationID',
'DOLocationID']
return df[columns]
df_clean_2019 = clean_data(df_2019, 2019)
df_clean_2020 = clean_data(df_2020, 2020)
def agg_daily(df):
"""Return a persisted data frame aggregated daily"""
return (df.groupby(['month', 'day', 'payment_type',
'passenger_count', 'PULocationID','DOLocationID'])
.trip
.count()
.reset_index()
.persist())
df_agg_daily_2020 = agg_daily(df_clean_2020)
df_agg_daily_2019 = agg_daily(df_clean_2019)
According to the NYC TLC, demand for the New York taxi went down by 90%, placing 83% of drivers in a tough spot wherein they have either struggled to afford food or could not afford food [10]. At the same time, COVID-19 cases have continuously increased, as shown in Figure 2. This has also led to nearly 40% of drivers either contracting COVID-19 or living with someone who tested positive for the virus [10]. With the drop in demand, what does this look like? By the numbers, how did COVID-19 affect mobility of New Yorkers? With the phased opening and decreasing case rates, are things looking better for NYC taxi drivers?
path = ('https://raw.githubusercontent.com/fedhere/PUI2015_EC/master/'
'mam1612_EC/nyc-zip-code-tabulation-areas-polygons.geojson')
with urlopen(path) as response:
data = json.load(response)
df_covid = pd.read_csv('data-by-modzcta.csv')
fig = px.choropleth_mapbox(df_covid,
center = dict(lat=40.74, lon=-73.96),
geojson=data,
featureidkey='properties.postalCode',
locations='MODIFIED_ZCTA',
color='COVID_CASE_COUNT',
mapbox_style='carto-positron',
zoom=8.5,
color_continuous_scale='YlOrRd')
printmd("<a id='fig2'>**Figure 2**</a>. Total COVID-19 cases by zip code "
"as of November 26, 2020.")
fig.show(renderer='notebook')
In our exploratory data analysis, we answered the following questions:
def agg_monthly(df, year):
"""Return a pandas data frame aggregated monthly"""
df = df.groupby('month')['trip'].sum().compute().reset_index()
df['date'] = pd.to_datetime(year + '/' +
df.month.astype(str) + '/'
+ '1')
return df
viz = (agg_monthly(df_agg_daily_2019, '2019')
.append(agg_monthly(df_agg_daily_2020, '2020')))
We provide a preliminary overview on the impact of coronavirus in Figure 3.
viz = (agg_monthly(df_agg_daily_2019, '2019')
.append(agg_monthly(df_agg_daily_2020, '2020')))
fig, ax = plt.subplots(figsize=(18,5))
ax.plot_date(viz['date'], viz['trip'], linestyle='-')
ax.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x,p: format(int(x/1000), ',')))
myFmt = mdates.DateFormatter('%b-%Y')
ax.xaxis.set_major_formatter(myFmt)
ax.set_xlabel('Month of Transaction')
ax.set_ylabel('Number of Trips (in thousands)')
printmd("<a id='fig3'>**Figure 3**</a>. Total monthly taxi transactions "
"from January 1, 2019 to June 30, 2020.")
plt.show();
Prior to the pandemic, monthly number of transactions would always range from $6$M to $8$M. January 2020, however, did not attain the monthly total of January 2019. Instead, it attained 2019’s lowest number of transactions which happened in July-August 2019. The January 2020 downtrend was sustained in February 2020, and it plummeted fast in March 2020. These record levels of low continue on until June 2020.
With this trend in mind, we continue to dissect this behavior in the following questions.
To answer this question, we zoom in to 2020 daily patterns presented in Figure 4.
def agg_daily(df, year):
"""Return a pandas dataframe aggregated daily"""
df = df.groupby(['month','day'])['trip'].sum().compute().reset_index()
df['date'] = pd.to_datetime(year + '/' +
df.month.astype(str) + '/'
+ df.day.astype(str))
return df
viz = agg_daily(df_agg_daily_2020, '2020')
viz = agg_daily(df_agg_daily_2020, '2020')
fig, ax = plt.subplots(figsize=(18,5))
ax.plot_date(viz['date'], viz['trip'], '-')
ax.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x,p: format(int(x/1000), ',')))
encircled_date = viz['date'][np.where(viz['date']=='2020-03-01')[0][0]]
val_of_encircled_date = viz['trip'][np.where(viz['date']=='2020-03-01')[0][0]]
ax.annotate(text="""
March 1, 2020
First case of COVID""",
xy=(encircled_date, 310_000))
ax.plot_date(encircled_date, val_of_encircled_date,
marker='o',
markerfacecolor='red', markersize=20,
alpha = 0.5)
ax.axvline(x=encircled_date, linestyle='--', color='red')
encircled_date = viz['date'][np.where(viz['date']=='2020-03-22')[0][0]]
val_of_encircled_date = viz['trip'][np.where(viz['date']=='2020-03-22')[0][0]]
ax.annotate(text="""
March 22, 2020
New York State on PAUSE""",
xy=(encircled_date, 310_000))
ax.plot_date(encircled_date, val_of_encircled_date,
marker='o',
markerfacecolor='red', markersize=20,
alpha = 0.5)
ax.axvline(x=encircled_date, linestyle='--', color='red')
myFmt = mdates.DateFormatter('%b-%d-%Y')
ax.xaxis.set_major_formatter(myFmt)
ax.set_xlabel('Day of Transaction')
ax.set_ylabel('Number of Trips (in thousands)')
ax.set_ylim(top=350_000)
printmd("<a id='fig4'>**Figure 4**</a>. Total daily taxi transactions "
"from January 1, 2019 to June 30, 2020.")
plt.show();
We take note that travel patterns persisted even during the first few days of March, as March 1, 2020 was the first recorded case in NYC. After March 1, the number of transactions even reached a peak. When Governor Andrew Cuomo issued stay-at-home orders on March 22 [11], demand for New York taxis have plummeted. Upon doing further research, the New York State government never really did issue a travel ban for taxis. Instead, the quick drop in transactions is a demand-driven impact of the stay-at-home orders to the general New York population. The period between March 1-21, 2020 also proved to have been a critical time for the New York State government to act quickly and prevent the spread of the virus [3]. Despite the New York on PAUSE issuance [11], it was too late; New York State government announced its record high number of cases of 9,000 on April 7, quickly surpassing China’s number of cases [4].
We also briefly examined what could explain the cyclical highs and lows in the daily number of transactions. To answer this, found in Figure 5 are calendar plots for the daily number of transactions in 2019 and 2020.
viz = agg_daily(df_agg_daily_2019, '2019').append(
agg_daily(df_agg_daily_2020, '2020'))
viz['trip'] = np.log(viz['trip'])
printmd("<a id='fig5'>**Figure 5**</a>. 2019 and 2020 calendar plots for "
"daily number of transactions.")
viz = agg_daily(df_agg_daily_2019, '2019')
fig, ax = calplot.calplot(viz[['date','trip']].set_index('date').squeeze(),
cmap='YlGn',fillcolor='grey',
linewidth=0.25, colorbar=False);
cbar = fig.colorbar(ax[0].get_children()[1], fraction=0.08,
ax=ax.ravel().tolist(), pad=0.1,
orientation='horizontal');
fig.set_size_inches(14,5)
# plt.tight_layout()
viz = agg_daily(df_agg_daily_2020, '2020')
fig, ax = calplot.calplot(viz[['date','trip']].set_index('date').squeeze(),
cmap='YlGn',fillcolor='grey',
linewidth=0.25, colorbar=False);
cbar = fig.colorbar(ax[0].get_children()[1], fraction=0.08,
ax=ax.ravel().tolist(), pad=0.1, orientation='horizontal');
fig.set_size_inches(14,5)
# plt.tight_layout()
We first examine the 2019 graph. We can infer that bulk of taxi transactions happened during the middle of week, as the darkest shades of greens are found in these rows. In contrast, the lows we found in Figure 2 can be found on Sundays and Mondays, or at the end and at the start of the week. Light green areas are mostly found on these rows.
The scale drastically changed in 2020. The lightest shade of green starts in the month of March.
The Philippines has seen a spike in cashless transactions since the pandemic hit [12], and we were expecting the same boom in the United States. Apparently, this was not the case. Despite the boom of cashless transactions to minimize contact [13], the New York City has issued a bill to ban cashless transactions in order “to blunt the impact of advancing technology on those who are unable to use it because of financial circumstances, unwilling to for philosophical reasons or vulnerable to its darker aspects” [14], [15].
Found in Figure 6 is the monthly proportion of transactions that are paid in cash and cashless.
def agg_monthly_payment_type(df, year):
"""Returns a pandas data frame aggregated monthly by payment_type"""
df = df.groupby(['month', 'payment_type'])['trip'].sum().compute().reset_index()
df['date'] = pd.to_datetime(year + '/' +
df.month.astype(str) + '/'
+ '1')
return df
viz = (agg_monthly_payment_type(df_agg_daily_2019, '2019')
.append(agg_monthly_payment_type(df_agg_daily_2020, '2020')))
viz['payment_type'] = viz['payment_type'].apply(lambda x: 'Cash' if x==2
else 'Cashless')
viz = pd.pivot_table(viz[['trip', 'date', 'payment_type']],
columns='payment_type', values='trip',
aggfunc=sum, index='date')
data_perc = viz.divide(viz.sum(axis=1), axis=0)*100
fig, ax = plt.subplots(figsize=(18,5))
ax.stackplot(data_perc.index,
[data_perc['Cashless'], data_perc['Cash']],
labels=['Cash', 'Cashless'],
alpha=0.85)
myFmt = mdates.DateFormatter('%b-%Y')
ax.xaxis.set_major_formatter(myFmt)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.set_xlabel('Month of Transaction')
ax.set_ylabel('Percent of Total Transaction')
legend = ax.legend(frameon = 1, fontsize='x-large')
frame = legend.get_frame()
frame.set_color('white')
frame.set_edgecolor('black')
printmd("<a id='fig6'>**Figure 6**</a>. Proportion of cash and cashless "
"transactions from January 1, 2019 to June 30, 2020.")
plt.show();
The pandemic has prompted the use of cashless payment methods by the time the pandemic hit, but as the economy reopened, the proportion of cash transactions has steadily increased. The effect of the bill is evident, as cash still reigns supreme in New York. It seems that the term “financial inclusion” in the time of pandemic is defined differently between the US and the Philippines.
Gov. Andrew Cuomo has enacted the 10-point New York State on PAUSE to “assure uniform safety for everyone” [11]. In particular, the following guidelines greatly drove down taxi demand.
“4. When in public individuals must practice social distancing of at least six feet from others;”
“7. Individuals should limit use of public transportation to when absolutely necessary and should limit potential exposure by spacing out at least six feet from other riders;”
However, we asked if these guidelines were still followed in taxi trips. Figure 7 shows the proportion of transactions with single, couple, and group passengers.
def agg_monthly_passenger_count(df, year):
"""Returns a pandas data frame aggregated monthly by passenger_type"""
df = df.groupby(['month',
'passenger_count'])['trip'].sum().compute().reset_index()
df['date'] = pd.to_datetime(year + '/' +
df.month.astype(str) + '/'
+ '1')
return df
viz = (agg_monthly_passenger_count(df_agg_daily_2019, '2019')
.append(agg_monthly_passenger_count(df_agg_daily_2020, '2020')))
viz['passenger_count'] = viz['passenger_count'].replace({1:'Single',
2:'Couple',
3:'Group', 4:'Group',
5:'Group', 6:'Group',
7:'Group', 8:'Group',
9:'Group'})
viz = pd.pivot_table(viz[['trip', 'date', 'passenger_count']],
columns='passenger_count', values='trip',
aggfunc=sum, index='date')
data_perc = viz.divide(viz.sum(axis=1), axis=0)*100
fig, ax = plt.subplots(figsize=(18,5))
ax.stackplot(data_perc.index,
[data_perc['Single'], data_perc['Couple'], data_perc['Group']],
labels=['Single', 'Couple', 'Group'],
alpha=0.85)
myFmt = mdates.DateFormatter('%b-%Y')
ax.xaxis.set_major_formatter(myFmt)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.set_xlabel('Month of Transaction')
ax.set_ylabel('Percent of Total Transaction')
legend = ax.legend(frameon = 1, loc='upper left', fontsize='x-large')
frame = legend.get_frame()
frame.set_color('white')
frame.set_edgecolor('black')
printmd("<a id=’fig7’>**Figure 7**</a>. Proportion of passenger count "
"from January 1, 2019 to June 30, 2020.")
plt.show();
Single passengers occupied the largest proportion of all taxi transactions, but this increased further with lockdown measures in place. Proportion of single passengers slightly dropped with the phased reopening of the NYC economy. Couple passengers decreased in proportion, but this gradually increased as lockdown measures eased in June 2020. Group passengers consistently took the smallest share of all transactions, but this shrunk further because of the pandemic. These observations, plus the fact that ridership has significantly dropped, all indicate that social distancing in taxi rides were followed to the fullest extent.
def agg_monthly_pu_do(df, year, extract='PULocationID'):
"""Return pick-up/drop off location """
df = df.groupby(['month', 'day',
extract])['trip'].sum().compute().reset_index()
df['date'] = pd.to_datetime(year + '/' +
df.month.astype(str) + '/'
+ df.day.astype(str))
return df
viz = (agg_monthly_pu_do(df_agg_daily_2019, '2019')
.append(agg_monthly_pu_do(df_agg_daily_2020, '2020')))
# Open shape file
with zipfile.ZipFile('NYC Taxi Zones.zip', 'r') as zip_ref:
zip_ref.extractall('./shp')
fp = './shp/geo_export_3b78f57d-632d-46ba-9dff-906b6c40ea39.shp'
df_shp_zones = gpd.read_file(fp)[['location_i', 'zone', 'geometry']]
viz = pd.merge(viz, df_shp_zones,
left_on='PULocationID', right_on='location_i', how='left')
viz['trip'] = np.log(viz['trip'])
max_colour = max(viz['trip'])
min_colour = min(viz['trip'])
cmap = cm.linear.YlOrRd_09.scale(min_colour, max_colour)
viz['colour'] = viz['trip'].map(cmap)
viz['date_sec'] = viz['date'].astype(int) / 10**9
viz['date_sec'] = viz['date_sec'].astype(int).astype(str)
# Prepare data for TimeSlider Choropleth
zone_list = viz['zone'].unique().tolist()
zone_idx = range(len(zone_list))
style_dict = {}
for i in zone_idx:
zone = zone_list[i]
result = viz[viz['zone'] == zone]
inner_dict = {}
for _, r in result.iterrows():
inner_dict[r['date_sec']] = {'color': r['colour'], 'opacity': 0.7}
style_dict[str(i)] = inner_dict
zones_df = viz[['geometry']]
zones_gdf = gpd.GeoDataFrame(zones_df)
zones_gdf = zones_gdf.drop_duplicates().reset_index()
main = folium.Map([40.74, -73.96], width='100%', height='80%',
tiles='cartodbpositron', zoom_start=10)
TimeSliderChoropleth(data=zones_gdf.to_json(),
styledict=style_dict).add_to(main)
cmap.add_to(main)
cmap.caption = "Log of Number of Transaction"
printmd("<a id=’fig8’>**Figure 8**</a>. Daily number of transactions per zone"
" by pickup from January 1, 2019 to June 30, 2020.")
main